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Abstract 

^SJ ■ We consider the problem of constructing deletion correcting codes over a binary alphabet and take a graph theoretic view. 

An n-bit s-deletion correcting code is an independent set in a particular graph. We propose constructing such a code by taking 
the union of many constant Hamming weight codes. This results in codes that have additional structure. Searching for codes 
£SJ \ in constant Hamming weight induced subgraphs is computationally easier than searching the original graph. We prove a lower 

bound on size of a codebook constructed this way for any number of deletions and show that it is only a small factor below the 
corresponding lower bound on unrestricted codes. In the single deletion case, we find optimal colorings of the constant Hamming 
weight induced subgraphs. We show that the resulting code is asymptotically optimal. We discuss the relationship between codes 
\ \ and colorings and observe that the VT codes are optimal in a coloring sense. We prove a new lower bound on the chromatic 

number of the deletion channel graphs. Colorings of the deletion channel graphs that match this bound do not necessarily produce 
asymptotically optimal codes. 

I. Introduction 

DELETION channels output only a subsequence of their input while preserving the order of the transmitted symbols. 
They have applications in biology, synchronization problems, and communication of information over packet networks. 
This paper concerns channels that take a binary input string of fixed length and a delete a fixed number of symbols. Despite 
significant effort on this case, there still are many fundamental open problems. In particular, we are interested in the design 
' ' , of s-deletion correcting codes and the cardinality of the largest possible codebook. 

Levenshtein gave partial answers to both problems. He derived asymptotic upper and lower bounds on the sizes of codes 
. for any number of deletions (2). He showed that the Varshamov Tenengolts (VT) codes, which had been designed to correct a 
f^) ■ single asymmetric error Q, (4), could be used to correct a single deletion. The VT codes meet the upper bound, so they are 

asymptotically optimal and establish the capacity of the single deletion channel. 
— ■ This paper addresses two questions related to code construction by taking a graph theoretic perspective. For each input string 
t-H , length and number of deletions, there is a graph that expresses all of the constraints on code construction. The vertices of this 
graph correspond to the binary strings of that length and a code is an independent set in the graph. The problem of finding a 
\ \ maximum independent set is NP Hard for general graphs. 

First, we present a two stage method for code construction. The method involves partitioning the vertices of the graph 
according to Hamming weight, finding codes in selected partitions, and taking the union of these codes. The substrings 
of a particular weight form a subgraph. Independent sets in this subgraph can be found in various ways, in particular, by 
\ exhaustive search, greedy search, or explicit graph coloring. Finding good codes in the subgraphs is less computationally 
intensive than exhaustively searching within the whole graph. For any number of deletions we prove a lower bound on the 
size of codes constructed using these subgraphs. This bound is within a small constant factor of the Levenshtein lower bound. 
This demonstrates that adding this restriction on codeword weights requires us to pay only a small penalty in the code sizes 
that we can guarantee. In the single deletion case, we use this method to construct new asymptotically optimal codes. These 
use an optimal coloring of the constant weight subgraphs. 

Second, having taken graph theoretic perspective, we ask if the existing codes of Varshamov and Tenengolts have a graph 
interpretation. We observe that VT codes are optimal colorings of the whole single deletion graphs. Any sequence of optimal 
colorings of the single deletion graphs produces sequences of codes that match Levenshtein's upper bound. We show that 
the same is not true for the multiple deletion graphs by deriving a lower bound on the chromatic number of the graphs for 
each string length and number of deletions. Even if there are sequences of colorings using the number of colors specified 
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by the bound, the coresponding sequences of codes are not guaranteed to match the Levenshtein upper bound. Consequently, 
either solving the coloring problem for multiple deletions is not sufficient for finding asymtotically optimal independent sets, 
or Levenshtein's upper bound on independent set size is not tight. 

A. Related Work 

A wide variety of code constructions have been proposed for the deletion channel and other closely related channels. These 
constructions vary significantly in code size, explicitness, and efficiency of construction so all comparisons must be done 
carefully. Tenengolts found an asymptotic upper bound on single deletion correcting codes over nonbinary alphabets. He 
constructed codes over each q-ary alphabet that are within a factor of of the bound (5). Helberg and Ferreira attempted to 
generalize the VT construction to any number of deletions, but the size of the resulting codes are far below Levenshtein's lower 
bound |6|. Schulman and Zuckerman considered a different asymptotic regime. They constructed nonexplicit but efficiently 
constructable codes for a channel that deletes a constant fraction of the symbols in each block Q. 

Another direction for the construction of codes is computational. It is well known that the problem of finding deletion 
correcting codes is equivalent to finding an independent set in a particular graph [8|. But since, for general graphs, finding 
the maximum independent set is NP-hard, exact algorithms rapidly become intractable with increasing input string length (n). 
Codes found via search usually lack structure and efficient decoding algorithms, but they are still interesting because they 
establish lower bounds on the size of optimal codes. For the case of the single deletion, the computational approach has 
established that VT codes are optimal for n < 10 (graph with 2 10 vertices) (5J- For multiple deletions, the best known codes 
have all been found through search algorithms. Butenko et al. found two-deletion correcting codes of maximum size for n < 10 
iflOl . Khajouei et al. used a heuristic algorithm to find the largest known two deletion correcting codes for n < 25 ATI . 

There has been much work on constructions, which provide lower bounds, but progess on upper bounds has been rare. Leven- 
shtein eventually refined his original asymptotic bound (and the parallel nonbinary bound of Tenengolts) into a nonasymptotic 
version [ 12 1. Kulkarni and Kiyavash recently proved a better upper bound for an arbitrary number of deletions and any alphabet 

size ma. 

There are several other lines of work attacking related combinatorial problems. One of these involves characterizing the sets 
of superstrings and substrings of any string. Levenshtein showed that the number of superstrings does not depend on the starting 
string |14|. Calabi and Hartnett gave a tight bound on the number of substrings of each length [151. Hirschberg extended the 
bound to larger alphabets fl6l . Swart and Ferreira gave a formula for the number of distinct substrings produced by two 
deletions for any starting string 11171 . Liron and Langberg improved and unified existing bounds and constructed tightness 
examples lfl8ll . 

B. Organization 

The paper is organized as follows. In Section [II] we give some notation and definitions related to the deletion channel and 
review the graph theoretic terminology and results. In Section [Til] we describe our code construction strategy and prove lower 
bounds on the sizes of the codes for any number of deletions. In Section [IV] we construct new asymptotically optimal single 
deletion correcting codes and show that colorings used in the VT codes and in our codes are both optimal. In Section [V] 
we discuss the relationship between optimal colorings and optimal independent sets for multiple deletion graphs and prove a 
lower bound on the number of colors needed for these graphs. Proofs of some technical results are found in two appendices. 
In Appendix [Aj we compute the weight distribution of the superstrings of a given string. In Appendix ICl we identify various 
induced subgraphs in these graphs, demonstrating that the graphs are not perfect. 

II. Preliminaries 

A. Notation 

Let [n] be the set of nonnegative integers less than n, {0, l..n— 1}. Let [2]™ be the set of binary strings of length n. Let [2]£ 
be the set of binary strings of length n with exactly k ones. Let H (x) be the Hamming weight of a string x. We will need the 
following asymptotic notation: let a(n) ~ b(n) denote that lim„^oo = 1 and a(n) < b(n) denote that lim„_ J . 00 < 1. 

We will use the following asymptotic equality frequently: for fixed c, (") ~ 

B. The deletion channel and associated graphs 

We will formalize the problem of correcting deletions by defining the deletion channel. The deletion channel takes a binary 
string of length n and outputs a substring of length n — s. For binary strings x and y, write x < y if x is a substring of y and 
define the following sets. 

Definition 1. For x € [2} n , define 

D s (x) = {z e [2] n ~ s \z<x}, 
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the set of substrings of x that can be produced by s deletions. Similarly 

I s {x)={we[2] n+S \w>x}, 

the set of superstrings of x that can be produced by s insertions. 

If x is the input to an n bit s deletion channel, D s (x) is the set of possible outputs. If x is the output from the channel, 
I s {x) is the set of possible inputs. 

When two inputs share common outputs they can potentially be confused by the receiver. 

Definition 2. For any two strings x,y € [2]™, define 

D s (x,y) = D s (x)nD s (y), 

the set of common substrings of length n — s. For any x G [2]", define 

N s (x)={ye[2] n \x\D s (x,y)^0}, 

the set of strings that share a common substring of length n — s with x. 

We are interested in codes that allow the correction of s deletions. 

Definition 3. A length n s-deletion correcting code is a set C C [2]™ such that for any two distinct binary strings x,y G C, 
D s (x,y) is empty. A length n s-deletion correction code is optimal if no larger code exists for those parameters. A sequence 
of s-deletion correction codes with increasing n is asymptotically optimal if the sequence of ratios of the their sizes to the 
optimal sizes goes to one. 

We can also characterize codes by defining a distance measure on binary strings. 

Definition 4. Let x G [2] m and y G [2]™ and let z G [2]' be a common substring of x and y of maximum length. Then x can 
be transformed into z by m — I deletion operations and z can be transformed into y by n — I insertion operations. Thus the 
deletion distance between x and y is d^i^x, y) — m + n — 21. 

It is well known that deletion distance is a metric Q. If x and y are the same length, then the deletion distance between 
them is even and 

d L {x,y)/2 = min{s G N\D s {x,y) ^ 0}. 

Now we have a metric characteriztion of an s-deletion correcting code: a set of codewords of length n in which the deletion 
distance between any two codewords is greater than 2s. Two codewords cannot both appear in a code if their deletion distance 
is 2s or less. We capture this condition by defining the following graph. 

Definition 5. For all s, n G N, let L Sj „ be a graph with [2]™ as its vertices. Vertices x and y are adjacent if and only if 

d L (x,y)/2 < s. 

Finally, we have a graphical characterization of an s-deletion correcting code: a set of vertices in L s n that have no edges 
between them. 

C. Independent Sets, Colorings, and Cliques 

Now we will briefly define some graph notation and review a few concepts that will be useful later. All of these are sourced 
from West |19|. Given a graph G, let V(G) denote its vertex set and let E(G) denote its edge set. Given S C V(G), the 
subgraph induced by S contains the vertices in S and the edges in E(G) that have both endpoints in S. 

An independent set in a graph is a set of vertices that are all nonadjacent. The size of a largest independent set in a graph 
G is denoted by a(G). The neighborhood of a vertex is the set of adjacent vertices. The degree of a vertex is the number 
of adjacent vertices. The maximum degree of any vertex in G is denoted by A(G). Every maximal independent set contains 
at least \V(G)\/ (A(G) + 1) vertices. This is because the union of the neighborhoods of the vertices in the independent set 
must contain all of the vertices in the graph. The average degree of the vertices of G is denoted by d(G). Because each edge 
contributes to the degree of two vertices, d(G) — 2\E(G)\/\V(G)\. Some independent set containing at least \V(G)\/ (d(G) + l) 
vertices always exists lfl9l p. 122]. This result is a version of Turan's Theorem. 

A k-coloring of a graph assigns a color (a element of [k]) to each vertex. The coloring is proper if it never assigns the 
same color to both endpoints of an edge. Thus a proper coloring of a graph partitions its vertices into independent sets; each 
independent set is assigned a single color and called a color class. The chromatic number of a graph G, denoted x(G), is the 
smallest k for which a proper fc-coloring of G exists. An argument based on greedy coloring of G shows that x(G) < A(G) + 1. 

A coloring gives us several independent sets to choose from, each corresponding to a color class. At least one of these color 
classes must be at least as large as the average size of a color class. Consequently, a(G) > \V(G)\/x(G). However, properly 
coloring a graph using the minimum number of colors is not equivalent to finding the largest independent set. In general there 
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A(G) + 1 > uj(G 2 

. X(G) ~"~ 

w(G) 

> 

Fig. 1: Inequalities between graph parameters. 

is no guarantee that the largest color class in a particular coloring is a maximum independent set or that any optimal coloring 
has a maximum independent set as a color class. 

A clique in a graph is a set of vertices that are all adjacent. The size of a largest clique in a graph G is denoted by uj(G). 
In a proper coloring, each vertex in a clique must be assigned a different color, so for any graph G, x{G) > uj(G). 

For any graph G, we can define its fth power, denoted G*. The vertex sets of G and G* are the same. Vertices are adjacent 
in G* if and only if there is a path between them in G of t or fewer edges. The neighborhood of any vertex in G is a clique 
in G 2 , so uj{G 2 ) > A(G) + 1. 

Deletion distance satisfies the triangle inequality, so the length of the shortest paths between vertices x and y in L S:1l is at 
most dL{x,y)/2s. This implies that if x and y are adjacent in (L S! „)*, then d,L(x,y)/2 < ts. Thus every edge in (X s ,n)* is 
present in L ts ,n and we have uj(L 2s ,n) > w((I s , n ) 2 ) > A(L S ,„) + 1. 

These inequalities are summarized in Fig. Q] 

D. Existing results 

Now that we have established some terminology and notation, we can concisely express some important existing results. 
Levenshtein proved the following asymptotic upper and lower bounds on the size of optimal s-deletion correcting codes El : 

on+s 9 n 

We give a proof of the lower bound in Section IIII-BI Notice that there is a gap between the upper and lower bounds for all 
numbers of deletions. 

For a single deletion, the VT construction asymptotically matches the upper bound and closes the gap. The VT construction 
uses a weight function to partition [2] n into n + 1 sets. Levenshtein showed that each of these sets is a code [2|, so each is 
an independent set in Li.„. The largest VT code (corresponding to VT weight zero) always contains at least ^-j- codewords. 
This matches the asymptotic upper bound, so a{L\_ n ) ~ — . The largest of these codes is conjectured to be optimal, i.e., it is 
conjectured to solve the maximum independent set problem on L\, n [8|. Kulkarni and Kiyavash lfl3ll show that these codes 
are within a factor of at most ^±1 of the largest for string length n. 

Levenshtein also showed that the number of distinct superstrings of a string produced by s insertions only depends on the 
length of the string Q31. For each x G [2]™~ s , 

\I s (x)\ = I s , n , (2) 



where 



For fixed s, this implies 



E 

i=0 



Is ' n ~{s)- (3) 

Calabi gave an upper bound on the number of substrings produced by s deletions IT31 . For each x E [2} n+s , 

\D s (x)\ < 7 s , n . (4) 
For any fixed length, only the two strings of alternating zeros and ones meet this bound with equality. 

III. Code construction by weight partitioning 

We now describe a strategy for code construction for any number of deletions. This strategy is inspired by a simple bound 
on deletion distance. 

Lemma 1. For all strings x, y S [2] ra , the deletion distance between them satisfies the lower bound (x, y) /2 > \H{x)—H{y) \. 



LlA,0 -^1,4,1 -^1,4,2 ^1,4,3 -^1,4,4 

Fig. 2: L\ .4 partitioned by Hamming weight. An independent set in each even weight layer is highlighted. 



Proof: Let z e [2]' be a longest common substring of x and y. Then z has at most as many ones than either x or y, so 

< mm{H{x),H(y)) 

It must also have at more as many zeros, so 

I — H(z) < min(n — H(x), n — H{y)). 

Combining these yields 

n - I > max(H(x), H(y)) - mm(H (x) , H (y)) . 

The deletion distance is 2(n — I), so the claim follows. ■ 
Let L S! n t k be the subgraph of L s „ induced by the vertices with exactly k ones. The endpoints of any edge in L s , n differ in 
Hamming weight by at most s. Suppose we find an independent set composed entirely of vertices of Hamming weight k, i.e. 
an independent set in L SyU ,k, and another independent set entirely of vertices of weight k + s + 1, we can guarantee that their 
union is an independent set in L s>n . Then we can add another independent set in L s n k+2 ( s +i) and continue until we have 
exhausted the weights that are equal to k mod s + 1. This procedure gives us an independent set in L s<n . Fig. [2] illustrates this 
for 1(1,4. 

More formally, we have the following result. 

Lemma 2. For each possible remainder < a < s, the constant weight strategy produces an s-deletion correcting code with 
at least ^ o<fc<n &(L s , n ,k) codewords. 

k=a mod s+1 

Another way to describe this process is that we start by throwing out all the vertices whose Hamming weights do not 
equal a mod s + 1. The remaining graph contains about of the original vertices and it is disconnected. The maximum 
independent set in this graph is the union of the maximum independent sets from each connected component. 

We have described how to build an independent set in L SiTl out of independent sets in the constant weight subgraphs. We 
can build a coloring of L s>n out of colorings of the constant weight subgraphs. 

Lemma 3. For n,k € N with < k < n, there is some proper c^-coloring of L s ,n,fc> fk '■ [2]^ — > [cfc]. Then there is a coloring 
function 

g : [2]™ -> [s + 1] x [maxcfc] 

h 

x i y (H(x) mod s + 1, f H (x)(x)) 

that is a proper coloring of L S} „. 

Proof: Let x and y be adjacent vertices in L s>n . If g is a proper coloring, it must assign them different colors. If 
H(x) = H(y), then f H(x) {x) £ f H{x) (y). From Lemma HJ \H(x) - H(y)\ < s so if H(x) ? H(y), then H(x) £ H(y) 
mod s+1. ■ 

A. Upper Bounds on Maximum and Average Degree 

The strategy outlined above reduces the problem of finding an independent set in L s „ to the problem of finding independent 
sets in each of L s ,„,fc, for < k < n. We would like to know how the sizes of codebooks produced by the constant weight 
approach compare to unrestricted codes. To make this comparison we will apply the same lower bounding technique to both 
types of codes. 
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Recall that a(G) > \V(G)\/(d(G) + 1) where d{G) = WTjT> ^ 

average degree of G. This translates an upper bound on 
average degree into a lower bound on maximum code size. We will apply this bound to both L s n and L s „ In the case of 
L s n , we will deduce Levenshtein's original lower bound on code size. 

The computation of the average degree of L SiH is simpler so we tackle it first. A very similar argument applies to computing 
the degree of a single specified vertex so we present the two together. 

Lemma 4. For all s, n £ N with s < n, the average degree and maximum degree in L s n satisfy 

d(L s n ) < 2 S I s ,n(Is,n ~ l)i 

d{L s>n ) < 2" 



2 

n 1 



^(-^s,n) Is,n—s{Is,n — l)j 
2 

/my 

A(i s ,„) < 



The asymptotic bounds are for fixed s. 

Proof: Vertices x and y are adjacent if and only if \D s (x, y)\ > 1. Thus in the whole graph we have 

\E{L a>n )\= ]T min(|U s (x,y)|,l). 

We can count the triples x, y £ [2] n , z £ [2] n ~ s such that x > z and y > z in two ways. On the left we sum over x and y 
and on the right we sum over z: 

V 2 , 

Recall that |J a (2)| is a constant equal to I s . n from (01. The average degree is given by d(i s ,„) = 2 ^y^ n ' 3 ^ > so 

d(L.,„) < ft") = 2 "* z -.»( 7 «.« - !) ~ 2 " (f 2 

To prove the bounds on maximum degree, we consider the neighborhood of a vertex instead of the entire graph. We have 

|JV,(a:)|= ™<\D s {x,y)\,l) 

ye[2]»\x 

and 

J2 \D s (x,y)\= Yl 

y£[2]"\x z£D B {x) 

= \D s (x)\(I s , n - 1), 

— Is,n — s\Is,n -0* 

The inequality follows from in Section III-DI Thus the maximum degree satisfies 

/ \ 2 

A.(L s n ) < J s n - S u s n — 1) *~ I 

Levenshtein's original lower bound follows immediately. 
Theorem 1. For all s,n £ N, f/zere exwf codebooks of size at least j — ^ _!) + 2 s " For fixed s, their size is asymptotically at 
least j^y- 

Proof: Codes are independent sets in L s n . Substituing the upper bound on d(L s n ) of Lemma [4] into Turan's theorem, 
a(G) > \V(G)\/(d(G) + 1), gives the result. ' ■ 
Levenshtein's original proof of the asymptotic version of this result used a different argument 1121 . He later proved the 
nonasymptotic version using what appears to be the same argument that we make here lfl2l . 
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B. Lower Bounds on Sizes of Code from the Constant Weight Strategy 

Now we extend this argument to the constant weight strategy. We used the total number of superstrings of a string to bound 
the average degree of L s>n , and we will use the number of superstrings of a given weight to bound the average degree of 
L s , n ,k- This will translate into a bound on the size of independent sets in L SyU ,k and independent sets in L S) „ with our weight 
restriction. We need some additional notation. 

Definition 6. For x e [2]£, let 

I(s,r){x) = {w € [2]££|io >x}. 

This is the set of superstrings of x with length n + s and weight k + r, the superstrings produced by inserting r ones and 
s — r zeros. 

Just as the size of I s (x) only depends on the length of x, the size of I/ 8 iT .)(x) only depends on the length and weight of x. 

Lemma 5. For all n, k, s, r £ N with < r < s < n and < k < n, and all x G [2]^~*, the number of superstrings of x 
with length n and weight k satisfies |J( SjJ .)(x)| = J2ilo C'^-T^)- 

The proof is quite involved; it requires a new representation of the elements of Ii s ,r)(') m terms of multisets. So as to not 
hinder the flow of our results, we have included it in Appendix [A] 
We will name this constant: 



WUn.fcl > . • (5) 



^(s,r) , (n,pn) 



min(r.s — r) > , . 

Efk + s- 2r\ (n — k — s + 2r 
\ s — r — i I \ r — i 

For the following lemma, we need the asymptotic value of this expression letting k = pn with fixed s, r, and p. The i = 
term of the sum is a degree s polynomial and all other terms are of lower degree. Thus we have 

7 pn + s — 2r\ (n — pn — s + 2r^ 
s — r J \ r 
pn \ fn — pn 
s — r J \ r 
(pn) s ~ r (n — pnf 
(s — r)! r! 

::)(:; 

We will use the following lemma in our computation of the average degree of the constant weight subgraphs. 
Lemma 6. For all s e N, 



P s ' r (l- P ) r . (6) 



0<r<s 



2 

rj 



Mp)= E (!) p^^-pY, 



is maximized at p = 1/2, so for all p, f s (p) < 2 s ( 2s ). 

The proof is in Appendix El The following lemma gives the average degree of L s>n ^. 

Lemma 7. Let k =pn. Then the average degree of the weight k subgraph satisfies d{L s . n ^k) < ^ p ^ 2 ^ ^ ( 2 S S )(") 2- 

Proof: Let a: be a string of length n — s and weight k — r for some < r < s. Any two vertices in /( s , r ) ( x ) are adjacent 
in L s , n ,fe. There are (^Zr) ( /<s r> 2 ( " ' c) ) sucn P a i rs of vertices. The endpoints of each edge in L s _ n ^ have at least one common 
substring of length n — s. The weight of this substring must be k — r for some < r < s because at most s ones were deleted 
from x or y to produce it. Thus every edge is counted at least once in the sum in 

2\E{L n , s , k )\ 2 ^ fn- s\ (I( s , r ),(n,k)\ m 
Recall that for fixed a, [ x j ~ 2y. The ratio of binomial coefficients simplifies asymptotically to 

(fc-r) _ (r) (s-r) k r (n — k) S _ _ I1 N )S->- /on 



x 



Substituting (O and ([8]l into (0 gives an asymptotic upper bound on d(L srl! k) of 

d(L s , n , k )< J2 P r (i- P y- r (( n )( s )p s - r (i-p) r 

0<r<s V \ / V / 

=p s ( i -py( n X e (;)V(i-pr, 

V ' 0<r<s V / 

Applying Lemma [6] give the final bound. 

d(L s ^k)<p s (l- P ) s ( n ) 2^ S P S 



We can now use the upper bound on average degree to get a lower bound on code size. 

For fixed s, 
5 codewords. 



Theorem 2. For fixed s, s-deletion correcting codes produced by the constant weight strategy contain asymptotically at least 

2"+ 3> 



(*+i)( 2 /)C) 

Proof: From Lemma |2] there is a code with at least 

1 - 

^ a(L St n,k) > — — -y 2J 

0<k<n ' k=0 

k=a mod s+1 

codewords. The inequality holds because for some a the resulting code is at least as large as the average. By Turan's Theorem, 
a(L s , n>k ) > \V(L Sjnt h)\/(d(L S:njk ) + 1). Taking the result of Lemma[7]and applying p(l - p) < 1/4 gives d{L s ^ k ) + 1 < 
Cs) O /2 3s - This bound does does not depend on k, so using Yl k =o \V(L s ,n.k)\ = 2™ completes the proof. ■ 

Corollary 1. The size of codebooks produced by the constant weight strategy is a factor of — 22s a < ^= below the 
Levenshtein lower bound. 



Proof: The ratio is 



yn+s I on+3s 



/ (^i)( 2 ;)(i)' 



(^+d( 2 

2 2s 



From Stirling's approximation, ( 2 s s ) < -^==. The result is immediate. ■ 
C. Algorithms 

In this section we will compare the algorithms that produce optimal codes, codes promised by Turan's theorem, and explicit 
codes. Computing the size of the largest independent set is NP-hard for general graphs. The best known exact algorithm that 
requires only polynomial space uses O(poly(n)2 - 288n ) time, where n is the number of vertices. 11201 . 

Theorem 3. The ratio of the upper bound on the run time of the best exact algorithm on L s ,„ to the sum of upper bounds 
for the run time on each of the graphs L«,n,* is &(poly(n)2 Q/28S( ' 1 ^ V 2 / 7 ™) 2 " ). 

Proof: For each n and s, there are only n/2 different graphs L Stn<k , so running the algorithm on all of them takes at 
most a linear factor longer than running the algorithm on the largest of them. The largest constant weight graph, L s n n / 2 , 



contains ( n y 2 ) vertices. By Stirling's approximation this is asymptotically J-^2 n . Thus the total run time is at most 

O(po/y(n)2 - 288 N/ 2 / ; ™( 2 ")). The run time for L s . n is at most 0(po^(n)2 a288 ( 2 ")). ■ 
However, the number of vertices in L s n n / 2 is still exponential in n so exact algorithms quickly become infeasible. 
There are many classes of graphs for which faster algorithms exist, but we have not found such a class that contains 

{L Stn \s,n € N}. One of the most general such classes is the class of perfect graphs. 

Theorem 4. For all s, n £ N with s > 1 and n > 3s + 1, L s n is not a perfect graph. 

The proof is in Appendix ICl it involves showing that there are odd cycles with no chords in L Sy „. 

The independent sets promised by Turan's Theorem (i.e. by Theorem|2]i can be found by a greedy algorithm using a minimum 
degree heuristic |19|. Greedy codes can be generated in time polynomial in the the number of vertices in the graph. Every 
vertex in L s<n is in some L Sin>k , so there is no time advantage to running a greedy algorithm on all of L S:Tl , k over running it 
on L s „, 
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The number of vertices in L s n is exponential in n, so even the greedy algorithms are slow. Because the independent sets 
that we seek contain exponentially many vertices, listing the members of a set is slow regardless of the complexity of the 
algorithm that we use to find the set. This difficulty leads to our interest in explicit codes, which satisfy an even stronger 
algorithmic condition. To demonstrate the difference between a greedy code and an explicitly constructed code, consider an 
independent set S in G as the indicator function I5 : V(G) — > [2]. In an explicit code, one can compute this function to 
test membership code quickly and in small space. In contrast, to test membership in a greedy code one can store the set of 
codewords and search, which requires space exponential in n, or regenerate the code, which requires time exponential in n. 

A fc-coloring of a graph G is naturally thought of as a function / : V(G) — > [k]. An easy to compute coloring function leads 
immediately to an easy to compute indicator function. In the following section we show an explicit construction of a single 
deletion correcting code using the constant weight approach. The weight condition together with a simple coloring function 
allow membership testing of a vertex in time and space linear in n. 

IV. SINGLE DELETION CONSTRUCTION 

In this section, we focus on the single deletion case (s = 1). We show an explicit construction of independent sets in the 
graphs Li n^. We construct these independent sets by finding a coloring of Lx <n This coloring is closely related to the VT 
codes in L\, n . The code that results from our coloring is asymptotically optimal. 

A. An explicit coloring of the constant weight single deletion graphs 

The VT construction uses a weight function to partition [2]™ into n+1 codes, each an independent set in L\, n . We observe 
that this makes the VT weight a proper (n + l)-coloring of Li „, so x{Li. n ) < n + 1. 

Both the VT coloring of L\, n and our colorings of Li tTlt k are based on the following weight function. 

Definition 7. For any x € [2]", let w(x) = X)iLo + l) x i- Call w(x) mod n + 1 the VT weight. Let fk(x) = w(x) mod 
(max(fc, n — k) + 1). We call the modified VT weight. 

Levenshtein showed that for each string length n, the Varshamov-Tenengolts construction provides n + 1 distinct single 
deletion correcting codes [2|. Restated in terms of graphs, the VT weight is a proper coloring of £i.„. 

Lemma 8. The modified VT weight is a proper coloring of L\ n ^. 

Proof: Let x and y be adjacent vertices in L\, n ^. We will show that /&(&) + 1 fk(y)- Index the symbols in x and y by 
[n], so x — (xo, ..,x n -i). For S C [n], let xs indicate the substring of x consisting of the symbols whose indices are in S. 
Note that Ya=u x i = YJi=o x i = fc > so 

71—1 71—1 

w(y) - w(x) =22(i + l)(y t - Xi ) = ^2 KVi - x i)- 

i=0 i=0 

Let a be the smallest index where x a 7^ y a and let b be the largest such index, so 

b 

w(y) - w(x) = ^2 i(Vi ~ x i)- 

i—a 

Because <Il{x, y) = 1, x and y have a common substring z of length n—1. Either z — a;r n i\ = y[ n ]\b or z = xr n i\j, = y\ n ]\ a . 
Without loss of generality assume the latter. Then for a < i < b — 1, Zi = Xi = yi+i- Because H(x) = H(y) = k, we have 

Xb = y a - 

b-l 

w{y) - w(x) = ay a - bx b + + l)y l+ i - ixi 

i = a 
b-l 

= (a - b)x h + y^Xj 

i—a 

Let / = Y^iZa x i> tne num ber of ones in X{i ..j-i}- There are two cases to consider, x^ = and Xb = 1. If x b = 0, then 
w(y) — w(x) — I. Since x + 1 y, x a = 1 and < I < k. If Xf, = 1, then w(x) — w(y) — b — a — I, the number of zeros 
in x {a..b-i}- Since x 7^ y, x a — and < b — a — I < n — k. In both cases, < \w(y) — w(x)\ < max(fc, n — k), so 
w{x) mod (max(fc, n — k) + 1) 7^ 0. ■ 
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B. Lower bounds on coloring single deletion graphs 

We show that both the VT coloring of Li >n and our coloring of Li jTlt k are optimal. In both cases, to demonstrate optimality, we 
will find cliques of matching size. Recall that the vertices in a clique must each be assigned different colors, so co(G) < 
The following lemma constructs these cliques. 

Lemma 9. For all s,n £ N, s < n, L s>n contains cliques of I s n vertices. For all r,s,k,n £ N such that < r < s and 
r <k< n s ~f~ v, L s n j~ contains cliques of -^(s.r),(n.fc) vertices. 

Proof: By (O, each string in [2]™ _s has I s n superstrings in [2]™. These are all adjacent in L s n , so they form a clique. 
By Lemma|5] each string in [2]£~* has Ir s ,r),(n,k) superstrings in [2]£. These are all adjacent in L s n ^, so they form a clique. 

■ 

The optimality of both colorings follows immediately. 
Theorem 5. For all n, the VT coloring of L\^ n is optimal and 

x(Li,n) = u)(L x<n ) = n + 1. 
For all n and 1 < k < n — 1, the coloring of L\ n k by the modified VT weight is optimal and 

x(Li, n ,k) = w(£i, n ,fe) = max(fc,n - k) + 1. 

Proof: By Lemma [9] L\, n contains cliques of Ii n = n + 1 vertices. The VT coloring uses n + 1 colors, so n + 1 < 

u(Li tn ) < x(Li, n ) < n + 1. 

By Lemma [9] L l n j, contains cliques of sizes Ino)t n ,k) = k + 1 and Iny\j n ,k) = n — k + 1. From Lemma [8] we have 
max(fc, n — k) + 1 < L)(Lx <n < x(-^i,n fe) < max(fc, n — k) + 1. ■ 

C. Asymptotic optimality of our codes 

We now show that taking the union of independent sets from Lx <n k produces an independent set in L l n that is asymptotically 
of optimal size. Let C n ^ be a largest color class of Li, n ,fe using the coloring described above. For a € [2], our code is the set 

D n ,a '■= |^_J Cn,fc- 

0<fc<n 
k=a mod 2 

Lemma 10. |-D„ i0 | > (2" — (^t)j where k* is the integer closest to n/2 such that k* ^ a mod 2. 
Proof: In each graph Li n^, some color class must be at least as large as the average, so 

\D n , a \= £ |c„, fe |> £ 

0<fc<n 0<fc<n 
fc=a mod 2 fc=a mod 2 

There are (?) vertices in Li >n> k and from Lemma[8]we have x(Li,n,k) < max(fc,n — k) + 1. Thus |-D ra , a | is at least 

E 1 /n\ 1 <J^> fn\ 1 



K k n~k+l ^ \kjk + l 

k=0 v 7 fc=fe*+l x 7 

fe=a mod 2 fc= a mo( j 2 

fc=a mod 2 k=a mod 2 



Because ("^ ) = + (?)> we can rewrite the lower bound as 



' n 

n + i \ ' — ' \K l * — ' \ k 

k=a x 7 k=k'+i 



n+ i v V fc * 

Theorem 6. 77ie sequence of codes D n a is asymptotically optimal. 
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Proof: By Stirling's formula, 



Recall from Section Hl-Dl that a(Li n ) ~ ?— , so the code is asymptotically optimal. ■ 
Note that maxfc x{Li, n .k) — ti, so the colorings constructed by Lemma [3] use 2n colors. They are far from optimal because 
x{Li.n) =n + l. However, most of the vertices are in the subgraphs with k w n/2, and x(-^i,n,n/2) = n /2 + 1. Thus, half 
the vertices have been thrown out, but the middle layers are colored about twice as efficiently as they were in the original 
graph. There are 2n color classes, but their size vary significantly and only n + 2 of them contain the most of the vertices. 
This explains the asymptotic optimality. 

V. Lower bounds on coloring L s , n 

In this section, we will show that for correcting multiple deletions, the independent sets guaranteed by an asymptotically 
optimal coloring do not match the Levenshtein upper bound. This means that either solving the coloring problem does not 
guarantee a solution to the independent set problem or the Levenshtein upper bound is not tight. 

More concretely, we show that x{L s ,n) ^ (™) (i s /2|) whereas for the average size of the color classes to match Levenshtein's 
upper bound, we need x(L s ,n) ~ (»)• 

In Section [TT] we gave two lower bounds on chromatic number for any graph G. First, x(G) > \V(G)\/a(G). Levenshtein's 
asymptotic upper bound is a(L s>n ) < 2"/(") Q. Combining these yields x{L s ,n) ^ (")■ Second, x(G) > uj{G). From 
Lemma |9j we know that cliques in L S) „ produced by a single common substring contain I s n vertices and I s n ~ (™). Again 
we get x(£ s ,n) > ("). 

In general, if the first bound is tight (a(L s . n )x{L s ^ n ) ~ | V(L Sin )\), then solving the coloring problem leads to many 
asymptotically optimal codes. For any asymptotically optimal sequence of colorings, almost all sequences of color classes are 
asymptotically optimal sequences of independent sets. In the single deletion case, this is the case and consequently one might 
hope that the same is true for all s. However, for multiple deletions we improve the second lower bound on chromatic number 
by showing that uj{L s . 71 ) > (™)( Ls / 2 j)' 

Consequently, average sized color classes in an optimal coloring of L s>n do not meet Levenshtein's upper bound on a(L s , n ). 
If we only know the chromatic number of L s<n , we can only guarantee the existance of color classes of the average size, 
2 n /x(L s . n ) vertices. It is possible that there are optimal colorings in which the size of the largest color class is much larger 
than the average size. It is also possible that average sized color classes in an optimal coloring are asymptotically optimal 
independent sets because the Levenshtein upper bound is not tight. 



A. Large cliques and high degree vertices 

To improve the lower bound on the chromatic number of L s>n , we need to find large cliques in L S; „. That is the goal of 
this section. 

In Li^ n , cliques produces by a single common substring are maximum, but in L s<n for s > 2 this a more general construction 
produces larger cliques. For any string x of length to, consider all of the strings of length n within deletion a deletion distance 
of s. By the triangle inequality, the deletion distance between any two of these strings is at most 2s, so they form a clique 
in L s , n . If we let to = n — s, then every string in the clique has i as a substring. If we let m = n + s, then every string in 
the clique has i as a superstring. Bigger cliques can be constructed by letting to be closer to n. Recall from Section Hl-CI that 
u>(G 2 ) > A(G) + 1 for any graph G because the neighborhood of any vertex in G is a clique in G 2 . When s is even, we can 
let to = n. In this case, x is also a vertex in L Sj7l and we are effectively applying the bound. 

Lemma 11. For any strings x, x' , y, y', not necessarily of the same length, cIl{xx' ' , yy') < d,L(x,y) + cIl(x' , y'). 

Proof: Let \x\ denote the length of x. The strings x and y have a common substring z of length (\x\ + \y\ — d,L(x,y))/2 
and x' and y' have a common substring z' of length (|x'| + \y'\ — <1l(x' ', y'))/2. The string zz' is a common substring of xx' 
and yy' , so the claimed bound holds. ■ 

Lemma 12. For all n £ N, the maximum clique size in L Sj „ satisfies uj(L s . n ) > (") (i s /2j) an d ^ ne maximum degree in L s>n 
satisfies A(L Sj „) ~ (™) 2 - 

Proof: For all b, c, k, I £ N with b + c < k, let m = k(l + 3) — 3 and n = m + b — c. We will construct a string x £ [2} m 
and a set S C [2] m such that \S\ = (*) ( s r )l b (l - 2) c and for all y £ S, d L (x, y) < b + c. We will specify each of these strings 
by their pattern of runs. All of these strings have the same first bit. All contain k segments separated by runs of length 3, so 
the separator differs from the last bit of the previous segment as well as the first bit of the next segment. 

We will use three types of segments, types A, B and C. Segments of type A have length / and consist of runs of length 1 . 
Segments of type B have total length I + 1 and contain one run of length 2 and I — 1 runs of length 1 . There are I possible run 
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X 

y 



01 1 1011 101 1I0I0I0I1 1011 1011 1011 II 1 110 II 1011 1011 



0I1I0I0I1I0I0I0I1 1011 1011 1011 II 1 1101 110 1011 1011 



B 



Fig. 3: Two of the strings constructed in the proof of Lemma []~2l the center string x, which has segments of types AAA, and 
a string y £ S, which has segments of types CAB. The parameters are I — 6, k — 3, and b = c = 1, 



patterns with this distribution. Each segment of type B is a superstring of the segment of type A. Segments of type C have 
total length I — 1 and contain one run of length 2 and I — 3 runs of length 1 . There are I — 2 possible run patterns with this 
distribution. Each segment of type B is a substring of the segment of type A. 

In x, all k segments are of type A. In an element of S, there are k — a — b segments of type A, b of type B, and c of type C. 
Thus there are (, c fe ! b _ c ) possible sequences of the types and ( fe c k k _ b _ c )l b (l - 2) c elements of S. Fig. [3] gives an example. 

Now we need to show that for all y £ S, d,L(x,y) < b + c. The number of runs within a segment is always I or I — 2, so 
the boundary runs of length 3 have the same compositions in x and all elements of S. In any y £ S, there are b + c segments 
that differ from x. In each case, the deletion distance between the segment in x and the segment in y is one. The rest of the 
strings match exactly. By Lemma [TT1 di,{x,y) < b + c. 
Taking k ~ I ~ s/n yields 



Ub+c 

\S\ ~ TTTl b+ 



bid bid \b 



n \ f n 



By the triangle inequality, for all y,z £ S, di(y, z) < 2(6 + c). Thus the vertices in S form a clique in Lb+ C ,n- To maximize 
the size of this clique for a given s, let b — [s/2\ and c = [s/2]. Then 



u>(L s 



> 



n w n \ / n\ I s 
ls/2]){[s/2]) ~ [s){[s/2\ 



If we let b — c = s, then x and all elements of S are the same length. Thus a; is a vertex in L s >n and S is a subset of its 
neighborhood. The degree of x in L s „ is at least \S\, so 

A(i s ,„) > 

From Lemma|4]we have A(i s n ) < (™) 2 . 

Corollary 2. For all s,n EN with s < n, the chromatic number of L s<n satisfies x(L s ,n) ^ (")(u/2|> 

Proof: This follows from the basic inequality x{L s ,n) > oj(L s>n ). 
This leads us to the main theorem of this section. 

Theorem 7. For all fixed s £ N with s > 2, the following inequalities hold but at most one is tight: 

on nn 

< a{L s>n ) < 



x[Ls n) „ V s,n,„ 

Proof: The lower bound follows from a(G)x(G) > \V(G)\. The upper bound is Levenshtein's. From Corollary [2] 



2™ 2" 



xOW)~e)( Ls ; 2J ) 



For S >2, ( [s ; 2j ) >2. 

Thus knowing the asymptotic behavior of x(L s ,n) does not give the asymptotic behavior of a(L s , n ). 



VI. Conclusion 

We investigated two approaches to code construction. We showed that a two stage approach that restricts the weight of 
codewords trades a small penalty in guaranteed code size for a large reduction in computational complexity of construction. 
This approach produces a new single deletion correcting code that is asymptotically optimal. 

The second approach that we investigated is code construction via graph coloring. The VT codes are an optimal coloring 
of the whole single deletion graph and our new code is built from optimal coloring of the constant weight single deletion 
graphs. We showed that for multiple deletions, the best possible colorings are not guaranteed to produce codes meeting the 
Levenshtein upper bound. If a coloring contains a color class that meet this upper bound, that class must be much larger than 
the average size of the classes in the coloring. 
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Appendix A 
Counting Superstrings 

Let x G [2]™ and y G I s {x) and consider a specific set of s insertions that create y from x. Suppose b is a new symbol and it 
is inserted immediately before Xi. If b — xu then we can produce the same superstring by instead inserting b immediately after 
Xi. Consequently, to produce any supersting it is sufficient to use only two types of insertions: insertions of the complement 
of Xi before x i7 and arbitrary insertions at the end of x. We would like to keep track of how many of our insertions are ones 
and how many are zeros. New zeros can be inserted before existing ones in x and at the end of x. New ones can be inserted 
before existing zeros in x and at the end of x. 

We can make these ideas precise with the following bijection. 

Lemma 13. For each x G [2]£, there is a bijection between I( s .r)(x) and 

U U ([2]r fc+6_1 >< w) >< ^t +a - x x [i]) x [2Y r zr b . 

a=0 6=0 

Proof: We will refer to the latter set as encodings of the insertions that produce y from x and denote the set as 
J(s,r),(n+s,k+r)- We will describe the bijection explicitly as an encoding function from I( s ^(x) to J( s . r ),(n+s.k+r) an d an 
inverse decoding function. 

To describe these algorithms, we need a few simple string operations. If a string is nonempty, it has a head that is a symbol 
and a tail that is another string. We write the empty string as e. We use a colon to indicate string concatenation. 
First, we describe the encoding function. 



Algorithm 1 Encoding y G i( s , r )(x) as z G J( s , r )(n+s,k+r) 

procedure Encoder, y) 

{zq,z\) <- (e,e) 
while x ^ e do 

(u,x) <- (Head (x), Tail (x)) 

(v,y) <- (HEAD(y),TAlL(y)) 

while v 7^ u do 

Z u i Z u . 1 

(v,y) 4- (He AD (y), Tail (y)) 
end while 

Z u i Z u . 

end while 

Z2 <- y 
return z 
end procedure 



Encode consumes symbols from y until it finds one that matches the head of x. It add a one to the output for each mismatch 
and adds a zero when it finally finds a match. Which output it uses depends on the current head of x When x runs out of 
symbols, any remaining symbols in y become the third part of the output. 

The first term of the product, zq specifies how many new ones to insert before each existing zero. The number of zeros in 
z is equal to the number of zeros in x and the last symbol of z is always zero. The total number of ones inserted this way is 
b, so zq G ([2]^~ fc+ ''~ 1 x [1]) for some b. The second term of the product specifies how many new zeros to insert before each 
existing one. The total number of zeros inserted this way is a, so z\ G ([2]* +a_1 x [1]) for some a. The third term specifies 
the insertions at the end of the string. There must be s — r — a zeros and r — b ones inserted there, so z 2 G [2]*I^~ b . 

Now we describe the decoding function: 
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X 


y 


7 r\ 

^0 






o 1 1 ooo 1 

Ul 1 WWW 1 


00 1 00 1 1 1 1 101 

WW 1 WW 1 W 1 W 1 W 1 1W1 








1 1 000 1 

1 1 WWW 1 


01 00 1010101 101 

W1WW1W1W1W1 1W1 


w 






1 0001 

1 WWWl 


00 1 1 1 1 1 1 

WW1W1W1W1 1W1 


w 


1 
1 w 




0001 

WWWl 


010101 101 
Wl Wl Wl 1W1 


w 


10110 

i wi i w 




001 


10101101 


00 


10110 




01 


101101 


0010 


10110 




l 


1101 


001010 


10110 






101 


001010 


101100 








001010 


101100 


101 



X 


7r\ 

^0 






11 

y 


1 1 000 1 

W 1 1 WWW 1 


00 1 1 
WW 1 W 1 w 


101 1 00 

1 W 1 1 WW 


1 01 

1 Wl 




1 1 000 1 

1 1 WWW 1 


01 01 

Wl Wl w 


1 1 1 00 

1 W 1 1 WW 


101 

1 Wl 


w 


10001 


01010 


1100 


101 


001 


0001 


01010 





101 


001001 


001 


1010 


o 


101 


0010010 


01 


10 





101 


001001010 


1 







101 
101 


00100101010 

001001010101 

001001010101101 



Fig. 4: The table |4(a)| illustrates the computation of Encode(x, y) and table |4(b)| illustrates the computation of 
Decode(x, z). In each case, there is a row for each iteration of the outer while loop. 



Algorithm 2 Decoding y e I( s , r ){x) from z e J( s , r )(n+s,fc+r) 

procedure Decode(x, z) 

y <- e 

while i^edo 

(u,x) <- (Head(x), Tail(x)) 
(w,z u ) <- (Head(2„),Tail(^„)) 
while w — 1 do 

y <- v ■ u 

{w,z u ) <- (HEAD(2; u ),TAIL(z u )) 
end while 
y i — y '. u 
end while 

v <- y ■ z 2 

return y 
end procedure 



The head of x determines whether DECODE inspects zo or z\. DECODE adds the complement of the head of x to the output 
for each one in Zb- When it finds a zero, it outputs the head of x and advances. When g reaches the end of x, it adds 22 to 
the output. 

It is easy to see that if y £ 7( s r )(x) and Encode(x, y) = z, then Decode(x, z) = y. Fig. 2] illustrates an example execution 
of each algorithm. 



We will use a few well known combinitorial results in following lemma and the lemma in the next appendix. 
Recall that Vandermonde's identity is 



II r h \ ( II \ j h 

8=0 



(9) 



This bijection correpsonding to this identity splits a string of length a + b into a string of length a and a string of length b. 
The sum is over all possible distributions of c ones in the original string between the new strings. 

The number of multisets with n possible unique elements and k elements is X )- Such a multiset can be represented as 

a string of n — 1 zeros and k ones. Each one corresponds to an element and the zero mark the boundaries between different 
types of elements. A version of Vandermonde's identity related to multiset counting is 

This decomposes a multiset with a + b possible unique elements into a multiset with a possible unique elements and a multiset 
with b possible unique elements. This also corresponds to breaking a string with a + b — 1 zeros at the location of its ath zero, 
so a — 1 zeros are in the first fragment and 6—1 are in the second fragment. 

Lemma |5j For all n, k, s, r G N with < r < s < n and < k < n, and all x € [2]JJ~^, the number of superstrings of x 
with length n and weight k satisfies \I M (x) | = ETlo "^ (t-7-T) r-i^)- 
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Proof: From Lemma [T3l for all x £ [2]£, |/( sr ) (x)| = \ J( s , r )(n+s,k+r)\- This value is 

s — r r 

EE 



f n — k + b — l\ /k + a — 1\ f s — a — b 



«=n b=0 

From Vandermonde's identity, (O, we have 



r — b 



s — a — b\ v— \ f s — r — a\ f r — b 



^E 

c=0 

min(r,s— r) 



?* — & / ^— ' \ c J \r — b — c 

c=0 



/ — ; - - II \ i i - I) 



s — r — a 

, s — r — a — c / V r — 6 — c 
c =o v 7 v 

Substituting this into the expression for I( s , r ).(n+s,k+r) an d exchanging the order of the sums yields 

min(r,s— r) /s — 7 



Ei s-^ /fc + a — 1\ { s — r — a 
\ ^ \ a J \s - r - a ~ c 



c=0 \a=0 



- /.• 4- it I \ / r — b 



6=0 



r — b — c 



The multiset variant of Vandermonde's identity, d 1 Qt > . eliminates the sums over a and b giving 

min(r,s-r) , . . 

/ k + s — r\ In -k + : 
^ \s — r — c I \ r — c 

Substituting n — s for n and k — r for fc yields the claimed result. 

Appendix B 
Proof of Lemma|6] 

Lemma R>1 For all s 6 N, 

\ 2 



0<r<s 
2s 1 



/•(p)= E L f s - r (i-p) r . 



is maximized at p = 1/2, so for all p, f s (p)<2 s ( j 5 ). 

Proof: To obtain the required upper bound, we express f s (p) as X^>o a «P l (l — pY wnere a i are nonnegative. 
Starting from Vandermonde's identity, (O, we can derive 



2 

s\ I s \ v-^ I r\ I s — r 



rj \r I i -r' \i I \s — r — % 



EC 
=E(,- 



s 

%,i,r — i, s — r — i 



Two of the four parts of the multinomial coefficient involve r. Isolating these yields 

r / ,£ -t s — 2i/ \ r — i J 

which will allow us to perform the desired change of basis. Crucially, the first binomial coefficient does not depend on r. 
Applying ( fTTT i to f(p) yields 
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The binomial theorem reduced the internal sum to 1. Applying p(l — p) < 1/4 yields 
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We undo the change of basis by expanding 2 s 2% with the binomial theorem, reordering the sums, and applying ( fTTT i. Finally 
Vandermonde's identity eliminates the sum. ■ 



Appendix C 
Induced Subgraphs and Graph Perfectness 

A graph G is perfect if and only if for each induced subgraph H, uj(H) — x{H) |19|. This is a hereditary property. A 
graph is perfect if and only if all of its induced subgraphs are perfect. 

Lemma 14. L s , n is an induced subgraph of L s ^ n+ \. 

Proof: Take the vertices of L s , n +i corresponding to the strings that begin with 0. ■ 

Lemma 15. Let C n be the cyclic graph with n vertices. For all n G N with n > 3, C n is an induced subgraph of L s J ( 7l _2)«+i- 

Proof: We will pick strings x h y,z € [2](™- 2 ) s+1 for < i < n - 3. For all < i < n - 3, x t = o^l^O^" -3- *), 
y = 10 (n ~ 2)s and z = ( "~ 2)s l. Then for < i < n - 4, d L (xi,x i+ i) = s, d L (x ,y) = s, d L (x n -3,z) = s, d L (y,z) = 1, 
and all other distances are greater than s. ■ 
As an example, for s = 1 and n = 5 we pick 1100,0110,0011,0001, and 1000. 

Theorem |4j For all s,neN with s > 1 ant/ n > 3s + 1, £ s .„ £s nof a perfect graph. 

Proof: By Lemma [Pfl L Si 3 S _(_i is an induced subgraph of L s>n . By Lemma [T5l the five cycle is an induced subgraph of 
L s 3 S +i- Odd cycles with at least five vertices are not perfect because a proper coloring requires three colors even though their 
largest clique contains only two vertices. ■ 
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